pics.png

MSBA 326 Final Project: Customer Segmentation using K Means Clustering

In [1]:
# Importing libraries                  
from __future__ import division
from IPython.display import Image
import warnings
warnings.filterwarnings('ignore')
In [2]:
import numpy as np
import matplotlib.pyplot as plt  
%matplotlib inline
import pandas as pd 
from pandas import read_csv
import seaborn as sb
from textwrap import wrap
import statsmodels.api as sm

from datetime import datetime, timedelta,date
from sklearn.metrics import classification_report,confusion_matrix
from sklearn.model_selection import KFold, cross_val_score, train_test_split
import seaborn as sns
from sklearn.cluster import KMeans
In [3]:
import plotly as py
import plotly.express as px
import plotly.offline as pyoff
import plotly.graph_objs as go
import matplotlib.pyplot as plt
import xgboost as xgb
import plotly.graph_objs as go
import plotly.offline as py
In [4]:
pd.options.mode.chained_assignment = None  # default='warn'
In [5]:
# Loading the data
filename = 'customer_segmentation.csv'
df = read_csv(filename, encoding="ISO-8859-1")

Data Exploration and Preparation

In [6]:
df.head(3)
Out[6]:
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country
0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 12/1/2010 8:26 2.55 17850.0 United Kingdom
1 536365 71053 WHITE METAL LANTERN 6 12/1/2010 8:26 3.39 17850.0 United Kingdom
2 536365 84406B CREAM CUPID HEARTS COAT HANGER 8 12/1/2010 8:26 2.75 17850.0 United Kingdom
In [7]:
df.isna().sum()
Out[7]:
InvoiceNo           0
StockCode           0
Description      1454
Quantity            0
InvoiceDate         0
UnitPrice           0
CustomerID     135080
Country             0
dtype: int64
In [8]:
df = df.dropna()
In [9]:
df.isna().sum()
Out[9]:
InvoiceNo      0
StockCode      0
Description    0
Quantity       0
InvoiceDate    0
UnitPrice      0
CustomerID     0
Country        0
dtype: int64
In [10]:
df.dtypes
Out[10]:
InvoiceNo       object
StockCode       object
Description     object
Quantity         int64
InvoiceDate     object
UnitPrice      float64
CustomerID     float64
Country         object
dtype: object
In [11]:
#converting the type of Invoice Date Field from string to datetime.
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'])

#creating YearMonth field for the ease of reporting and visualization
df['InvoiceYearMonth'] = df['InvoiceDate'].map(lambda date: 100*date.year + date.month)
In [12]:
df.dtypes
Out[12]:
InvoiceNo                   object
StockCode                   object
Description                 object
Quantity                     int64
InvoiceDate         datetime64[ns]
UnitPrice                  float64
CustomerID                 float64
Country                     object
InvoiceYearMonth             int64
dtype: object
In [13]:
# Let's see which country gets maximum number of orders
fig = px.histogram(df, x="Country")
fig.update_layout(xaxis = go.layout.XAxis(tickangle = 45))
fig.show()

As per above graph, United Kingdom made up the majority of the sales figure. This is not surprizing given that this is a store is based in the UK.

Since majority of data is from the UK, We will work on segmenting market and all our analyses using UK data only.

Creating a new dataframe with UK records

In [14]:
ukdf=df.loc[df['Country'] == 'United Kingdom']

Finding Outliers

In [15]:
# Creating a new column for Total amount per transaction
ukdf['Revenue'] = ukdf['Quantity']*ukdf['UnitPrice']
In [16]:
# Taking a look at negative values.
NegTrans = ukdf[ukdf['Revenue'] < 0]
NegTrans.head(5)
Out[16]:
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country InvoiceYearMonth Revenue
141 C536379 D Discount -1 2010-12-01 09:41:00 27.50 14527.0 United Kingdom 201012 -27.50
154 C536383 35004C SET OF 3 COLOURED FLYING DUCKS -1 2010-12-01 09:49:00 4.65 15311.0 United Kingdom 201012 -4.65
235 C536391 22556 PLASTERS IN TIN CIRCUS PARADE -12 2010-12-01 10:24:00 1.65 17548.0 United Kingdom 201012 -19.80
236 C536391 21984 PACK OF 12 PINK PAISLEY TISSUES -24 2010-12-01 10:24:00 0.29 17548.0 United Kingdom 201012 -6.96
237 C536391 21983 PACK OF 12 BLUE PAISLEY TISSUES -24 2010-12-01 10:24:00 0.29 17548.0 United Kingdom 201012 -6.96

A look at the quantity, unitPrice, Revenue values also reveal some negative values. These can be intepreted as refunds. As the total quantity and total revenue values of refunds and the original purchase cancel together, we will leave them as it is.

In [17]:
# Dropping unnecessary column
ukdf = ukdf.drop(['StockCode'], axis =1)
In [18]:
ukdf.shape
Out[18]:
(361878, 9)
In [19]:
# Descriptive Analysis before removing outliers
ukdf[['Quantity', 'UnitPrice','Revenue']].describe()
Out[19]:
Quantity UnitPrice Revenue
count 361878.000000 361878.000000 361878.000000
mean 11.077029 3.256007 18.702086
std 263.129266 70.654731 451.918484
min -80995.000000 0.000000 -168469.600000
25% 2.000000 1.250000 3.750000
50% 4.000000 1.950000 10.200000
75% 12.000000 3.750000 17.700000
max 80995.000000 38970.000000 168469.600000

1. Unit Price

In [20]:
ukdf.groupby('Description').mean()['UnitPrice'].nlargest()
Out[20]:
Description
DOTCOM POSTAGE                    744.147500
PICNIC BASKET WICKER 60 PIECES    649.500000
CRUK Commission                   495.839375
Manual                            232.239948
POSTAGE                           211.128023
Name: UnitPrice, dtype: float64

A look at the highest mean values bring up some suspect "products". We will dig deeper into above products to detect outliers.

In [21]:
df[df['Description']== 'DOTCOM POSTAGE']['UnitPrice'].describe()
Out[21]:
count      16.000000
mean      744.147500
std       540.829708
min        11.170000
25%       389.082500
50%       715.850000
75%       998.635000
max      1599.260000
Name: UnitPrice, dtype: float64
In [22]:
df[df['Description']== 'Manual']['UnitPrice'].describe()
Out[22]:
count      465.000000
mean       334.425634
std       1935.070401
min          0.000000
25%          0.850000
50%          3.000000
75%        106.400000
max      38970.000000
Name: UnitPrice, dtype: float64
In [23]:
df[df['Description']== 'CRUK Commission'].head()
Out[23]:
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country InvoiceYearMonth
317508 C564763 CRUK CRUK Commission -1 2011-08-30 10:49:00 1.60 14096.0 United Kingdom 201108
324023 C565382 CRUK CRUK Commission -1 2011-09-02 15:45:00 13.01 14096.0 United Kingdom 201109
333779 C566216 CRUK CRUK Commission -1 2011-09-09 15:17:00 15.96 14096.0 United Kingdom 201109
338848 C566565 CRUK CRUK Commission -1 2011-09-13 12:32:00 52.24 14096.0 United Kingdom 201109
351003 C567655 CRUK CRUK Commission -1 2011-09-21 14:40:00 608.66 14096.0 United Kingdom 201109
In [24]:
df[df['Description']== 'Discount'].head()
Out[24]:
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country InvoiceYearMonth
141 C536379 D Discount -1 2010-12-01 09:41:00 27.50 14527.0 United Kingdom 201012
9038 C537164 D Discount -1 2010-12-05 13:21:00 29.29 14527.0 United Kingdom 201012
14498 C537597 D Discount -1 2010-12-07 12:34:00 281.00 15498.0 United Kingdom 201012
19392 C537857 D Discount -1 2010-12-08 16:00:00 267.12 17340.0 United Kingdom 201012
31134 C538897 D Discount -1 2010-12-15 09:14:00 5.76 16422.0 United Kingdom 201012

We found below 4 items that have unusual unit price and also does not seem like a product:

  1. DOTCOM POSTAGE - seems to indicate the amount spent by the customer on postage

  2. CRUK Commission - fee paid out to an external cancer research organization

  3. Manual - refers to manual services rendered with the purchase like furniture assembly

  4. Discount

All these items are not direct indicator of sales and are heavily skewing the data. So, we will remove them from our dataset.

In [25]:
removeitems = ['DOTCOM POSTAGE', 'CRUK Commission', 'Manual','Discount']
ukdf = ukdf[~ukdf['Description'].isin(removeitems)]
In [26]:
# Let's check if we still have any rows with unusual unitprice i.e. greater than 2K
ukdf[ukdf.UnitPrice > 2000]
Out[26]:
InvoiceNo Description Quantity InvoiceDate UnitPrice CustomerID Country InvoiceYearMonth Revenue
173277 C551685 POSTAGE -1 2011-05-03 12:51:00 8142.75 16029.0 United Kingdom 201105 -8142.75
173382 551697 POSTAGE 1 2011-05-03 13:46:00 8142.75 16029.0 United Kingdom 201105 8142.75
In [27]:
removeinvoice1 = ['C551685', '551697']
ukdf = ukdf[~ukdf['InvoiceNo'].isin(removeinvoice1)]
In [28]:
# So, we deleted 1315 rows, but since we still have 494K of data it should be enough.
ukdf.shape
Out[28]:
(361385, 9)
In [29]:
ukdf.head()
Out[29]:
InvoiceNo Description Quantity InvoiceDate UnitPrice CustomerID Country InvoiceYearMonth Revenue
0 536365 WHITE HANGING HEART T-LIGHT HOLDER 6 2010-12-01 08:26:00 2.55 17850.0 United Kingdom 201012 15.30
1 536365 WHITE METAL LANTERN 6 2010-12-01 08:26:00 3.39 17850.0 United Kingdom 201012 20.34
2 536365 CREAM CUPID HEARTS COAT HANGER 8 2010-12-01 08:26:00 2.75 17850.0 United Kingdom 201012 22.00
3 536365 KNITTED UNION FLAG HOT WATER BOTTLE 6 2010-12-01 08:26:00 3.39 17850.0 United Kingdom 201012 20.34
4 536365 RED WOOLLY HOTTIE WHITE HEART. 6 2010-12-01 08:26:00 3.39 17850.0 United Kingdom 201012 20.34

Let's plot frequency distribution of some of the relevant columns

1. UnitPrice

In [30]:
fig = px.histogram(ukdf, x="UnitPrice",
                 labels={
                     "UnitPrice": "UnitPrice",
                     },
                title="UnitPrice Frequency")
fig.show()
In [31]:
# As we see most of the unit prices are under 50, we will take a closer look
ukdf1=ukdf.query('UnitPrice < 50')['UnitPrice']
fig = px.histogram(ukdf1, x="UnitPrice",
                 labels={
                     "UnitPrice": "UnitPrice",
                     },
                title="UnitPrice Frequency")
fig.show()

2. Quantity

By running a quick filter, we see that there are only 106 rows which have quantities above 1000 and below -1000. Upon closer look we also found they mostly lie in small ticket items and most of them were returned, indicating they were bought by mistake. Thus we will remove these outliers.

In [32]:
ukdf[(ukdf['Quantity'] > 1000) | (ukdf['Quantity'] < -1000)]
Out[32]:
InvoiceNo Description Quantity InvoiceDate UnitPrice CustomerID Country InvoiceYearMonth Revenue
4287 C536757 ROTATING SILVER ANGELS T-LIGHT HLDR -9360 2010-12-02 14:23:00 0.03 15838.0 United Kingdom 201012 -280.80
4850 536809 ASSORTED COLOUR T-LIGHT HOLDER 1824 2010-12-02 16:48:00 0.55 15299.0 United Kingdom 201012 1003.20
4945 536830 WORLD WAR 2 GLIDERS ASSTD DESIGNS 2880 2010-12-02 17:38:00 0.18 16754.0 United Kingdom 201012 518.40
4946 536830 RED HARMONICA IN BOX 1400 2010-12-02 17:38:00 1.06 16754.0 United Kingdom 201012 1484.00
16435 537659 CREAM HEART CARD HOLDER 1008 2010-12-07 16:43:00 2.31 18102.0 United Kingdom 201012 2328.48
... ... ... ... ... ... ... ... ... ...
533812 581115 METAL SIGN TAKE IT OR LEAVE IT 1404 2011-12-07 12:20:00 2.75 15195.0 United Kingdom 201112 3861.00
540070 581458 POPCORN HOLDER 1500 2011-12-08 18:45:00 0.72 17949.0 United Kingdom 201112 1080.00
540071 581459 POPCORN HOLDER 1200 2011-12-08 18:46:00 0.72 17949.0 United Kingdom 201112 864.00
540421 581483 PAPER CRAFT , LITTLE BIRDIE 80995 2011-12-09 09:15:00 2.08 16446.0 United Kingdom 201112 168469.60
540422 C581484 PAPER CRAFT , LITTLE BIRDIE -80995 2011-12-09 09:27:00 2.08 16446.0 United Kingdom 201112 -168469.60

106 rows × 9 columns

In [33]:
# We will remove all the quantities above 1K and their corresponding return transaction of less than -1K.
ukdf = ukdf[(ukdf['Quantity'] > -1000)]
In [34]:
ukdf = ukdf[(ukdf['Quantity'] < 1000)]
In [35]:
ukdf2 = ukdf[ukdf.Quantity > 0]        # leaving out negative values as they are return items
fig = px.histogram(ukdf2, x="Quantity",
                 labels={
                     "Quantity": "Quantity",
                     },
                title="Quantity Frequency")
fig.show()
In [36]:
# Most number of quantities lie below 100, so taking a closer look at the histogram
ukdf3=ukdf2.query('Quantity < 100')['Quantity']
fig = px.histogram(ukdf3, x="Quantity",
                 labels={
                     "Quantity": "Quantity",
                     },
                title="Quantity Frequency")
fig.show()

Finding top ten selling items by their total sales

In [37]:
sales_order = ukdf.groupby('Description').sum()['Revenue'].nlargest(10)

plt.figure(figsize = (30,10))
ax = sb.barplot(x = sales_order.index, y = sales_order.values, palette = 'viridis')
ax.set_xlabel('Product Description', fontsize = 20)
ax.set_ylabel('Total Sales', fontsize = 20)
ax.set_title('Top 10 Selling Products', fontsize = 30)

labels = [ '\n'.join(wrap(l, 15)) for l in sales_order.index ]
ax.set_xticklabels(labels, fontsize = 15)

value_labels = []
for x in sales_order.values:
    value_labels.append(str(int(x/1000))+' k')

for p, label in zip(ax.patches, value_labels):
    ax.annotate(label, (p.get_x() + 0.26, p.get_height() + 2), fontsize = 15)

It seems 3 tier cakestand is the highest revenue generating item.

In [38]:
ukdf.shape   # Even after cleaning up the outliers, we still have 490K records so we should be good to go.
Out[38]:
(361269, 9)
In [39]:
# Descriptive Analysis of relevant columns after removing outliers
ukdf[['Quantity', 'UnitPrice','Revenue']].describe()
Out[39]:
Quantity UnitPrice Revenue
count 361269.000000 361269.000000 361269.000000
mean 10.664627 2.900097 18.549886
std 30.077532 4.637613 89.692975
min -960.000000 0.000000 -4522.500000
25% 2.000000 1.250000 3.750000
50% 4.000000 1.950000 10.200000
75% 12.000000 3.750000 17.700000
max 992.000000 649.500000 38970.000000

Customer segmentation using RFM Clustering

In [40]:
#creating a generic user dataframe to keep CustomerID and new segmentation scores
ukdf_user = pd.DataFrame(ukdf['CustomerID'].unique())
ukdf_user.columns = ['CustomerID']
ukdf_user.head()
Out[40]:
CustomerID
0 17850.0
1 13047.0
2 13748.0
3 15100.0
4 15291.0

Recency Analysis

In [41]:
# Creating a dataframe with max purchase date for each customer 
uk_max_purchase = ukdf.groupby('CustomerID').InvoiceDate.max().reset_index()
uk_max_purchase.columns = ['CustomerID','MaxPurchaseDate']
uk_max_purchase.head(3)
Out[41]:
CustomerID MaxPurchaseDate
0 12747.0 2011-12-07 14:34:00
1 12748.0 2011-12-09 12:20:00
2 12749.0 2011-12-06 09:56:00
In [42]:
# Comparing the last transaction of the dataset with last transaction dates of the individual customer IDs.
uk_max_purchase['Recency'] = (uk_max_purchase['MaxPurchaseDate'].max() - uk_max_purchase['MaxPurchaseDate']).dt.days
uk_max_purchase.head()
Out[42]:
CustomerID MaxPurchaseDate Recency
0 12747.0 2011-12-07 14:34:00 1
1 12748.0 2011-12-09 12:20:00 0
2 12749.0 2011-12-06 09:56:00 3
3 12820.0 2011-12-06 15:12:00 2
4 12821.0 2011-05-09 15:51:00 213
In [43]:
#merging this dataframe to the new user dataframe
ukdf_user = pd.merge(ukdf_user, uk_max_purchase[['CustomerID','Recency']], on='CustomerID')
ukdf_user.head()
Out[43]:
CustomerID Recency
0 17850.0 301
1 13047.0 45
2 13748.0 95
3 15100.0 329
4 15291.0 25
In [44]:
# Plotting Recency histogram
fig = px.histogram(ukdf_user, x="Recency",
                 labels={
                     "Recency": "Last Purchase days",
                     },
                title="Recency")
fig.show()

We will now apply K-means clustering to assign a recency score. We start with specifying the number of clusters we need for K-means algorithm. To find it out, we will apply Elbow Method that tells the optimal cluster number for optimal inertia. Code snippet and Inertia graph are as follows:

In [45]:
from sklearn.cluster import KMeans

sse={} # error
uk_recency = ukdf_user[['Recency']]
for k in range(1, 10):
    kmeans = KMeans(n_clusters=k, max_iter=1000).fit(uk_recency)
    uk_recency["clusters"] = kmeans.labels_  #cluster names corresponding to recency values 
    sse[k] = kmeans.inertia_ #sse corresponding to clusters
plt.figure()
plt.plot(list(sse.keys()), list(sse.values()))
plt.xlabel("Number of cluster")
plt.show()
In [46]:
#  3 looks like an optimal number. We will try expand our scope and use 4 for our further analysis. 
In [47]:
#building 4 clusters for recency and add it to dataframe
kmeans = KMeans(n_clusters=4)
ukdf_user['RecencyCluster'] = kmeans.fit_predict(ukdf_user[['Recency']])
In [48]:
ukdf_user.head()
Out[48]:
CustomerID Recency RecencyCluster
0 17850.0 301 1
1 13047.0 45 3
2 13748.0 95 0
3 15100.0 329 1
4 15291.0 25 3
In [49]:
ukdf_user.groupby('RecencyCluster')['Recency'].describe()
Out[49]:
count mean std min 25% 50% 75% max
RecencyCluster
0 953.0 77.713536 22.969307 48.0 59.0 72.0 93.0 132.0
1 461.0 306.819957 40.391402 247.0 269.0 302.0 351.0 373.0
2 577.0 187.129983 32.548982 133.0 158.0 185.0 214.0 246.0
3 1944.0 17.524177 13.235872 0.0 6.0 16.0 28.0 47.0

Right now each customer is randomly assigned to each cluster. However, clusters are not ordered because cluster 2 customers are active than cluster 0 but older than both cluster 1 & 3. Let us order the clusters according to most recent transactions by finding the mean of recency value corresponding to each cluster.

In [50]:
# Building the function 
def order_cluster(cluster_field_name, target_field_name,df,ascending):
    new_cluster_field_name = 'new_' + cluster_field_name
    df_new = df.groupby(cluster_field_name)[target_field_name].mean().reset_index()
    df_new = df_new.sort_values(by=target_field_name,ascending=ascending).reset_index(drop=True)
    df_new['index'] = df_new.index
    df_final = pd.merge(df,df_new[[cluster_field_name,'index']], on=cluster_field_name)
    df_final = df_final.drop([cluster_field_name],axis=1)
    df_final = df_final.rename(columns={"index":cluster_field_name})
    return df_final

ukdf_user = order_cluster('RecencyCluster', 'Recency',ukdf_user,False)
In [51]:
ukdf_user.head()
Out[51]:
CustomerID Recency RecencyCluster
0 17850.0 301 0
1 15100.0 329 0
2 18074.0 373 0
3 16250.0 260 0
4 13747.0 373 0
In [52]:
ukdf_user.groupby('RecencyCluster')['Recency'].describe()
Out[52]:
count mean std min 25% 50% 75% max
RecencyCluster
0 461.0 306.819957 40.391402 247.0 269.0 302.0 351.0 373.0
1 577.0 187.129983 32.548982 133.0 158.0 185.0 214.0 246.0
2 953.0 77.713536 22.969307 48.0 59.0 72.0 93.0 132.0
3 1944.0 17.524177 13.235872 0.0 6.0 16.0 28.0 47.0

Frequency Analysis

In [53]:
# To create frequency clusters, we need to find total number of orders for each customer. 
#get order counts for each user and create a dataframe with it
uk_frequency = ukdf.groupby('CustomerID').InvoiceDate.count().reset_index()
uk_frequency.columns = ['CustomerID','Frequency']
In [54]:
uk_frequency.head() # number of orders per customer
Out[54]:
CustomerID Frequency
0 12747.0 103
1 12748.0 4625
2 12749.0 231
3 12820.0 59
4 12821.0 6
In [55]:
# We will add this data to our generic dataframe

ukdf_user = pd.merge(ukdf_user, uk_frequency, on='CustomerID')

ukdf_user.head()
Out[55]:
CustomerID Recency RecencyCluster Frequency
0 17850.0 301 0 312
1 15100.0 329 0 6
2 18074.0 373 0 13
3 16250.0 260 0 24
4 13747.0 373 0 1
In [56]:
# Plotting Recency histogram
fig = px.histogram(ukdf_user, x="Frequency",
                 labels={
                     "Frequency": "Frequency of purchase",
                     },
                title="Frequency")
fig.show()
In [57]:
# The maximum frequencies are below 1000. Visualizing frequencies below 1000.
df2=ukdf_user.query('Frequency < 1000')['Frequency']
fig = px.histogram(df2, x="Frequency",
                 labels={
                     "Frequency": "Frequency of purchase",
                     },
                title="Frequency")
fig.show()
In [58]:
# Frequency clusters

# Determining the right number of clusters for K-Means by elbow method
from sklearn.cluster import KMeans

sse={} # error
uk_recency = ukdf_user[['Frequency']]
for k in range(1, 10):
    kmeans = KMeans(n_clusters=k, max_iter=1000).fit(uk_recency)
    uk_recency["clusters"] = kmeans.labels_  #cluster names corresponding to recency values 
    sse[k] = kmeans.inertia_ #sse corresponding to clusters
plt.figure()
plt.plot(list(sse.keys()), list(sse.values()))
plt.xlabel("Number of cluster")
plt.show()
In [59]:
# Applying k-Means

kmeans=KMeans(n_clusters=4)
ukdf_user['FrequencyCluster']=kmeans.fit_predict(ukdf_user[['Frequency']])

#ordering the frequency cluster
ukdf_user = order_cluster('FrequencyCluster', 'Frequency', ukdf_user, True )
ukdf_user.groupby('FrequencyCluster')['Frequency'].describe()
Out[59]:
count mean std min 25% 50% 75% max
FrequencyCluster
0 3481.0 49.650675 44.899483 1.0 15.00 33.0 73.00 190.0
1 429.0 330.799534 133.693005 191.0 228.00 287.0 395.00 803.0
2 22.0 1310.500000 507.013830 871.0 975.75 1139.5 1451.75 2780.0
3 3.0 5897.000000 1811.444727 4625.0 4860.00 5095.0 6533.00 7971.0
In [60]:
# Max frequency can be seen in cluster 3, least frequency cluster is cluster 0.

Monetary Value

In [61]:
# We will now cluster our customers based on revenue. 
#calculate revenue for each customer
uk_revenue = ukdf.groupby('CustomerID').Revenue.sum().reset_index()
In [62]:
uk_revenue.head()
Out[62]:
CustomerID Revenue
0 12747.0 4196.01
1 12748.0 30410.53
2 12749.0 3868.20
3 12820.0 942.34
4 12821.0 92.72
In [63]:
# merging uk_revenue with our main dataframe
ukdf_user = pd.merge(ukdf_user, uk_revenue, on='CustomerID')
In [64]:
ukdf_user.head(3)
Out[64]:
CustomerID Recency RecencyCluster Frequency FrequencyCluster Revenue
0 17850.0 301 0 312 1 5288.63
1 15808.0 305 0 210 1 3724.77
2 13047.0 45 3 194 1 3089.10
In [65]:
# Visualizing with the histogram on a reduced scale 
#plot the histogram
df4=ukdf_user.query('Revenue < 10000')['Revenue']
fig = px.histogram(df4, x="Revenue",
                 labels={
                     "Revenue": "Monetary value of purchase",
                     },
                title="Revenue")
fig.show()
In [66]:
# Using elbow method to find out the optimum number of clusters for K-Means

from sklearn.cluster import KMeans

sse={} # error
uk_recency = ukdf_user[['Revenue']]
for k in range(1, 10):
    kmeans = KMeans(n_clusters=k, max_iter=1000).fit(uk_recency)
    uk_recency["clusters"] = kmeans.labels_  # cluster names corresponding to recency values 
    sse[k] = kmeans.inertia_ # sse corresponding to clusters
plt.figure()
plt.plot(list(sse.keys()), list(sse.values()))
plt.xlabel("Number of cluster")
plt.show()
In [67]:
# Elbow method suggests optimal clusters can be 3 or 4. We will take 4 as the number of clusters

#apply clustering
kmeans = KMeans(n_clusters=4)
ukdf_user['RevenueCluster'] = kmeans.fit_predict(ukdf_user[['Revenue']])

#order the cluster numbers
ukdf_user = order_cluster('RevenueCluster', 'Revenue',ukdf_user,True)

#show details of the dataframe
ukdf_user.groupby('RevenueCluster')['Revenue'].describe()
Out[67]:
count mean std min 25% 50% 75% max
RevenueCluster
0 3673.0 907.251158 913.528992 -1192.20 264.6500 571.200 1253.2200 4330.67
1 234.0 7776.389957 3650.351991 4345.50 5182.2450 6529.380 9212.1675 24419.14
2 26.0 43519.356538 15555.329820 25758.10 30637.2675 39976.360 53020.8450 88136.03
3 2.0 208993.375000 65063.673867 162986.41 185989.8925 208993.375 231996.8575 255000.34
In [68]:
# It seems that cluster 3 has max revenue and cluster 0 has lowest revenue
In [69]:
# Overall scores

# Now we have scores (cluster numbers) for recency, frequency & revenue. We will now create an overall score.
# Calculating overall score and use mean() to see details
ukdf_user['OverallScore'] = ukdf_user['RecencyCluster'] + ukdf_user['FrequencyCluster'] + ukdf_user['RevenueCluster']
ukdf_user.groupby('OverallScore')['Recency','Frequency','Revenue'].mean()
Out[69]:
Recency Frequency Revenue
OverallScore
0 307.039387 21.984683 300.277396
1 187.975482 32.861646 511.108268
2 78.841132 47.075082 841.082984
3 20.811296 68.452492 1108.762625
4 14.876667 269.900000 3681.182733
5 9.794521 377.273973 8786.118767
6 7.740741 871.888889 22885.218889
7 1.857143 1270.142857 100250.702857
8 1.333333 5897.000000 41334.060000
In [70]:
# It seems that customer with score 7 is our best customer and one with score 0 is least attractive customer.
In [71]:
# Analysing low value and high value customers 
ukdf_user['Segment'] = 'Low-Value'
ukdf_user.loc[ukdf_user['OverallScore']>2,'Segment'] = 'Mid-Value' 
ukdf_user.loc[ukdf_user['OverallScore']>4,'Segment'] = 'High-Value' 
In [72]:
ukdf_user.head()
Out[72]:
CustomerID Recency RecencyCluster Frequency FrequencyCluster Revenue RevenueCluster OverallScore Segment
0 17850.0 301 0 312 1 5288.63 1 2 Low-Value
1 14688.0 7 3 359 1 5107.38 1 5 High-Value
2 13767.0 1 3 399 1 16945.71 1 5 High-Value
3 15513.0 30 3 314 1 14520.08 1 5 High-Value
4 14849.0 21 3 392 1 7904.28 1 5 High-Value

Visualizing segments with scatter plots

In [73]:
#Revenue vs Frequency
uk_graph = ukdf_user.query("Revenue < 10000 and Frequency < 1000")

plot_data = [
    go.Scatter(
        x=uk_graph.query("Segment == 'Low-Value'")['Frequency'],
        y=uk_graph.query("Segment == 'Low-Value'")['Revenue'],
        mode='markers',
        name='Low',
        marker= dict(size= 7,
            line= dict(width=1),
            color= 'blue',
            opacity= 0.8
           )
    ),
        go.Scatter(
        x=uk_graph.query("Segment == 'Mid-Value'")['Frequency'],
        y=uk_graph.query("Segment == 'Mid-Value'")['Revenue'],
        mode='markers',
        name='Mid',
        marker= dict(size= 9,
            line= dict(width=1),
            color= 'green',
            opacity= 0.5
           )
    ),
        go.Scatter(
        x=uk_graph.query("Segment == 'High-Value'")['Frequency'],
        y=uk_graph.query("Segment == 'High-Value'")['Revenue'],
        mode='markers',
        name='High',
        marker= dict(size= 11,
            line= dict(width=1),
            color= 'red',
            opacity= 0.9
           )
    ),
]

plot_layout = go.Layout(
        yaxis= {'title': "Revenue"},
        xaxis= {'title': "Frequency"},
        title='Segments'
    )
fig = go.Figure(data=plot_data, layout=plot_layout)
pyoff.iplot(fig)
In [74]:
#Revenue vs Recency

uk_graph = ukdf_user.query("Revenue < 10000 and Frequency < 1000")

plot_data = [
    go.Scatter(
        x=uk_graph.query("Segment == 'Low-Value'")['Recency'],
        y=uk_graph.query("Segment == 'Low-Value'")['Revenue'],
        mode='markers',
        name='Low',
        marker= dict(size= 7,
            line= dict(width=1),
            color= 'blue',
            opacity= 0.8
           )
    ),
        go.Scatter(
        x=uk_graph.query("Segment == 'Mid-Value'")['Recency'],
        y=uk_graph.query("Segment == 'Mid-Value'")['Revenue'],
        mode='markers',
        name='Mid',
        marker= dict(size= 9,
            line= dict(width=1),
            color= 'green',
            opacity= 0.5
           )
    ),
        go.Scatter(
        x=uk_graph.query("Segment == 'High-Value'")['Recency'],
        y=uk_graph.query("Segment == 'High-Value'")['Revenue'],
        mode='markers',
        name='High',
        marker= dict(size= 11,
            line= dict(width=1),
            color= 'red',
            opacity= 0.9
           )
    ),
]

plot_layout = go.Layout(
        yaxis= {'title': "Revenue"},
        xaxis= {'title': "Recency"},
        title='Segments'
    )
fig = go.Figure(data=plot_data, layout=plot_layout)
fig.show(renderer="notebook")
In [75]:
# Recency vs Frequency Visualization 

uk_graph = ukdf_user.query("Revenue < 50000 and Frequency < 2000")

plot_data = [
    go.Scatter(
        x=uk_graph.query("Segment == 'Low-Value'")['Recency'],
        y=uk_graph.query("Segment == 'Low-Value'")['Frequency'],
        mode='markers',
        name='Low',
        marker= dict(size= 7,
            line= dict(width=1),
            color= 'blue',
            opacity= 0.8
           )
    ),
        go.Scatter(
        x=uk_graph.query("Segment == 'Mid-Value'")['Recency'],
        y=uk_graph.query("Segment == 'Mid-Value'")['Frequency'],
        mode='markers',
        name='Mid',
        marker= dict(size= 9,
            line= dict(width=1),
            color= 'green',
            opacity= 0.5
           )
    ),
        go.Scatter(
        x=uk_graph.query("Segment == 'High-Value'")['Recency'],
        y=uk_graph.query("Segment == 'High-Value'")['Frequency'],
        mode='markers',
        name='High',
        marker= dict(size= 11,
            line= dict(width=1),
            color= 'red',
            opacity= 0.9
           )
    ),
]

plot_layout = go.Layout(
        yaxis= {'title': "Frequency"},
        xaxis= {'title': "Recency"},
        title='Segments'
    )
fig = go.Figure(data=plot_data, layout=plot_layout)
fig.show(renderer="notebook")

We used K means clustering algorithm and created customer clusters that we will use for segmenting our market in the United Kingdom.

The clusters are created using Recency, Frequency, and Monetary Value (revenue) framework, popular in marketing. From the analysis and visualization above, we now know which customers we need to retain (high value) and which segment does not require our attention (low value).

We acknowledge the work of Shailaja Gupta that we used for our study. Reference: Gupta, S. (n.d.). Customer Segmentation: RFM Clustering. Retrieved from https://kaggle.com/shailaja4247/customer-segmentation-rfm-clustering

The picture on the top is taken from an online article. Reference: Ong, Adeline(2020, Mar 16). Segmenting Customers using K-Means, RFM and Transaction Records. Retrieved from https://towardsdatascience.com/segmenting-customers-using-k-means-and-transaction-records-76f4055d856a

                                 *** End of the Project***